Load data

library(readr)
brands <- read.csv(file="BelkinElagoComplete.csv", header = TRUE, sep = ";")
head(brands, 10)
##      salary age elevel car zipcode credit  brand
## 1  123476.6  23      4   1       0 779.56  Elago
## 2  120274.6  22      4   1       2 784.70  Elago
## 3  121735.5  26      4   1       1 749.35  Elago
## 4  138276.8  23      4   1       0 743.85  Elago
## 5  126869.2  23      4   1       0 759.03  Elago
## 6  130595.1  22      4   1       7 774.13  Elago
## 7  121358.0  80      4   1       2 732.03 Belkin
## 8  127457.4  71      4   1       5 744.29 Belkin
## 9  137670.5  75      4   1       0 815.00 Belkin
## 10 120088.7  59      4   1       7 740.48 Belkin

Load Libraries

library(funModeling) 
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
library(tidyverse) 
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.1     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ purrr   0.3.2     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
library(Hmisc)

Initial exploration 1

Number of observations (rows) and variables, and a head of the first cases

glimpse(brands)
## Observations: 10,000
## Variables: 7
## $ salary  <dbl> 123476.6, 120274.6, 121735.5, 138276.8, 126869.2, 130595…
## $ age     <int> 23, 22, 26, 23, 23, 22, 80, 71, 75, 59, 73, 59, 77, 58, …
## $ elevel  <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ car     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ zipcode <int> 0, 2, 1, 0, 0, 7, 2, 5, 0, 7, 6, 6, 2, 1, 5, 3, 5, 7, 8,…
## $ credit  <dbl> 779.56, 784.70, 749.35, 743.85, 759.03, 774.13, 732.03, …
## $ brand   <fct> Elago, Elago, Elago, Elago, Elago, Elago, Belkin, Belkin…

Initial exploration 2

Checking missing values, zeros, data type, and unique values Probably one of the first steps, when we get a new dataset to analyze, is to know if there are missing values (NA in R) and the data type.

The df_status function coming in funModeling can help us by showing these numbers in relative and percentage values. It also retrieves the infinite and zeros statistics.

df_status(brands)
##   variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1   salary       0    0.00    0    0     0     0 numeric   9757
## 2      age       0    0.00    0    0     0     0 integer     61
## 3   elevel       0    0.00    0    0     0     0 integer      4
## 4      car       0    0.00    0    0     0     0 integer     20
## 5  zipcode    1097   10.97    0    0     0     0 integer      9
## 6   credit       0    0.00    0    0     0     0 numeric   8651
## 7    brand       0    0.00    0    0     0     0  factor      2

We see that there are 1097 examples or 11% of the variable zipcode with the zero value. As we know that it is regular zip code value it is fine and we want to transform this variable to categorical. We see that the variable elevel should also be transformed to categorical. Let’s do that now:

brands$elevel <- as.factor(brands$elevel)
brands$zipcode <- as.factor(brands$zipcode)
df_status(brands)
##   variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1   salary       0    0.00    0    0     0     0 numeric   9757
## 2      age       0    0.00    0    0     0     0 integer     61
## 3   elevel       0    0.00    0    0     0     0  factor      4
## 4      car       0    0.00    0    0     0     0 integer     20
## 5  zipcode    1097   10.97    0    0     0     0  factor      9
## 6   credit       0    0.00    0    0     0     0 numeric   8651
## 7    brand       0    0.00    0    0     0     0  factor      2

Analyzing categorical variables

Now that we have converted our variables to factors, we can run the function freq() that runs for all factor variables. We get some nice plots and tables as the result.

freq(brands)

##   elevel frequency percentage cumulative_perc
## 1      2      4404      44.04           44.04
## 2      3      3497      34.97           79.01
## 3      1      1435      14.35           93.36
## 4      4       664       6.64          100.00

##   zipcode frequency percentage cumulative_perc
## 1       6      1167      11.67           11.67
## 2       8      1144      11.44           23.11
## 3       2      1125      11.25           34.36
## 4       5      1113      11.13           45.49
## 5       4      1103      11.03           56.52
## 6       0      1097      10.97           67.49
## 7       3      1094      10.94           78.43
## 8       7      1091      10.91           89.34
## 9       1      1066      10.66          100.00

##    brand frequency percentage cumulative_perc
## 1  Elago      5348      53.48           53.48
## 2 Belkin      4652      46.52          100.00
## [1] "Variables processed: elevel, zipcode, brand"

Analyzing numerical variables

We will see: plot_num and profiling_num. Both run automatically for all numerical/integer variables:

plot_num(brands)

profiling_num(brands)
##   variable       mean      std_dev variation_coef       p_01      p_05
## 1   salary 84897.2509 37709.806791      0.4441817 20000.0000 25921.608
## 2      age    49.8115    17.596322      0.3532582    20.0000    22.000
## 3      car    10.4728     6.023661      0.5751719     1.0000     1.000
## 4   credit   632.0648    91.371581      0.1445605   445.4995   482.457
##         p_25     p_50        p_75        p_95       p_99     skewness
## 1 52109.2827 84968.94 117168.2605 143297.0010 150000.000 -0.006945249
## 2    35.0000    50.00     65.0000     77.0000     80.000  0.008270035
## 3     5.0000    10.00     16.0000     20.0000     20.000  0.058502642
## 4   563.3825   632.07    701.2975    783.5315    815.005  0.002483984
##   kurtosis       iqr            range_98                     range_80
## 1 1.814148 65058.978     [20000, 150000] [32480.123939, 137027.67826]
## 2 1.800191    30.000            [20, 80]                     [26, 74]
## 3 1.718976    11.000             [1, 20]                      [3, 19]
## 4 2.235649   137.915 [445.4995, 815.005]           [508.999, 757.386]

Analyzing numerical and categorical at the same time

describe from Hmisc package

library(Hmisc)
describe(brands)
## brands 
## 
##  7  Variables      10000  Observations
## ---------------------------------------------------------------------------
## salary 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    10000        0     9757        1    84897    43544    25922    32480 
##      .25      .50      .75      .90      .95 
##    52109    84969   117168   137028   143297 
## 
## lowest :  20000.00  20105.59  20140.31  20141.14  20218.82
## highest: 149919.44 149964.88 149973.46 149994.21 150000.00
## ---------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    10000        0       61        1    49.81    20.32       22       26 
##      .25      .50      .75      .90      .95 
##       35       50       65       74       77 
## 
## lowest : 20 21 22 23 24, highest: 76 77 78 79 80
## ---------------------------------------------------------------------------
## elevel 
##        n  missing distinct 
##    10000        0        4 
##                                   
## Value          1     2     3     4
## Frequency   1435  4404  3497   664
## Proportion 0.144 0.440 0.350 0.066
## ---------------------------------------------------------------------------
## car 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    10000        0       20    0.997    10.47    6.937        1        3 
##      .25      .50      .75      .90      .95 
##        5       10       16       19       20 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency    634   280   841   420   445   395   722   749   264   429
## Proportion 0.063 0.028 0.084 0.042 0.044 0.040 0.072 0.075 0.026 0.043
##                                                                       
## Value         11    12    13    14    15    16    17    18    19    20
## Frequency    312   423   443   423   636   409   420   533   423   799
## Proportion 0.031 0.042 0.044 0.042 0.064 0.041 0.042 0.053 0.042 0.080
## ---------------------------------------------------------------------------
## zipcode 
##        n  missing distinct 
##    10000        0        9 
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency   1097  1066  1125  1094  1103  1113  1167  1091  1144
## Proportion 0.110 0.107 0.112 0.109 0.110 0.111 0.117 0.109 0.114
## ---------------------------------------------------------------------------
## credit 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    10000        0     8651        1    632.1    104.9    482.5    509.0 
##      .25      .50      .75      .90      .95 
##    563.4    632.1    701.3    757.4    783.5 
## 
## lowest : 416.59 419.73 421.85 422.08 422.59, highest: 846.38 847.65 847.73 848.56 849.00
## ---------------------------------------------------------------------------
## brand 
##        n  missing distinct 
##    10000        0        2 
##                         
## Value      Belkin  Elago
## Frequency    4652   5348
## Proportion  0.465  0.535
## ---------------------------------------------------------------------------

We should pay special attention to variable car. It is just not realistic that our customers have on average 10 cars or that the most of the customers have 18 cars or more. It would be more realistic that we are dealing here with codes for car brands, so maybe we should convert the variable cars to factor.

brands$car <- as.factor(brands$car)

freq(brands$car)

##    var frequency percentage cumulative_perc
## 1    3       841       8.41            8.41
## 2   20       799       7.99           16.40
## 3    8       749       7.49           23.89
## 4    7       722       7.22           31.11
## 5   15       636       6.36           37.47
## 6    1       634       6.34           43.81
## 7   18       533       5.33           49.14
## 8    5       445       4.45           53.59
## 9   13       443       4.43           58.02
## 10  10       429       4.29           62.31
## 11  12       423       4.23           66.54
## 12  14       423       4.23           70.77
## 13  19       423       4.23           75.00
## 14   4       420       4.20           79.20
## 15  17       420       4.20           83.40
## 16  16       409       4.09           87.49
## 17   6       395       3.95           91.44
## 18  11       312       3.12           94.56
## 19   2       280       2.80           97.36
## 20   9       264       2.64          100.00

Let’s do it all over again for practice using ggplot2 library

library(ggplot2)
str(brands)
## 'data.frame':    10000 obs. of  7 variables:
##  $ salary : num  123477 120275 121736 138277 126869 ...
##  $ age    : int  23 22 26 23 23 22 80 71 75 59 ...
##  $ elevel : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 1 3 2 1 1 8 3 6 1 8 ...
##  $ credit : num  780 785 749 744 759 ...
##  $ brand  : Factor w/ 2 levels "Belkin","Elago": 2 2 2 2 2 2 1 1 1 1 ...

Create histograms for numerical variables

salary_hist <- ggplot(brands, aes(x=brands$salary)) + geom_histogram(bins=10) + theme_classic()
age_hist <- ggplot(brands, aes(x=brands$age)) + geom_histogram(bins=12) + theme_classic()
credit_hist <- ggplot(brands, aes(x=brands$credit)) + geom_histogram(bins=10) + theme_classic()

salary_hist

age_hist

credit_hist

Create bar charts for categorical variables

elevel_bar <- ggplot(brands, aes(x = elevel)) + geom_bar()

car_bar <- ggplot(brands, aes(x = car)) + geom_bar() 

zip_bar <- ggplot(brands, aes(x = zipcode)) + geom_bar()
  
brand_bar <- ggplot(brands, aes(x = brand)) + geom_bar()  

elevel_bar

car_bar

zip_bar

brand_bar 

Relations between the variables

Let us first see the relations between education level and salary_hist:

elevel_salary <- ggplot(brands, aes(x = elevel, y = salary)) +
    geom_boxplot()
elevel_salary

There is linear relationship between salary and education level. The lowest educated customers earn on average around 40000 and the highest educated almost 140000.

We can also plot variable brand by setting the color parameter:

elevel_salary_brand <- ggplot(brands, aes(x = elevel, y = salary, color = brand)) +
    geom_boxplot()
elevel_salary_brand

Relationship between age and brand is also interesting with younger customers preferably buying Elago and older customers Belkin

brand_age <- ggplot(brands, aes(x = brand, y = age)) +
    geom_boxplot()
brand_age

brand_salary <-ggplot(brands, aes(x = brand, y = salary)) +
    geom_boxplot()

brand_salary

This plot is telling us that customers who buy Belking earn in range between 60000 and 110000, whereas Elago buyers earn between 40000 and 125000.

brands %>% 
 ggplot(aes(x = salary, fill = elevel)) +
    geom_histogram(bins = 25) 

brands %>% 
ggplot(aes(x = elevel, fill = brand)) +
           geom_bar()

brands %>% 
ggplot(aes(x = elevel, fill = brand)) +
           geom_bar(position= "fill")

brands %>% 
ggplot( aes(x = elevel, y = brand)) +
           geom_jitter() + coord_flip()

brands %>%
  ggplot(aes(x=age, y = elevel, color=brand)) + geom_jitter() +coord_flip()

ggplot(brands, aes(salary, age)) + geom_point() + theme_classic() + geom_point(color = ifelse(brands$brand == "Belkin", 'red', 'black')) 

ggplot(brands, aes(salary, age)) + geom_point() + theme_classic() + geom_point(color = ifelse(brands$brand == "Elago", 'purple', 'black'))

##Correlation matrix and variable importance

Here we want to check correlation between our variables. First we create a table and a plot with correlations to our target brand

correlation_table(data=brands, target="brand")
##   Variable brand
## 1    brand  1.00
## 2   salary -0.02
## 3   credit -0.02
## 4      age -0.35
variable_importance <- var_rank_info(data=brands, target="brand")

ggplot(variable_importance, aes(x = reorder(var, gr), y = gr, fill = var)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  theme_bw() + 
  xlab("") + 
  ylab("Variable Importance 
       (based on Information Gain)"
       ) + 
  guides(fill = FALSE)

* en entropy measured in bits * mi mutual information * ig information gain * gr gain ratio

We see that the most important variable for predicting brand would be salary followed with credit, and then age and elevel. We want to check if salary and credit are highly corelated.

cor(brands$salary, brands$credit)
## [1] 0.8700001

It seems that they are highly corelated. We will have to omit credit variable from our model.

Let us continue with the funModeling library and make some cross plots. This plot intent to show in real scenarios if a variable is or not important, making a visual summary of it, (by grouping numerical variables into bins/groups).

brands$salary_discretized = 
  equal_freq(var=brands$salary, n_bins = 8)
summary(brands$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   52109   84969   84897  117168  150000
cross_plot(brands, input="salary_discretized", target="brand", auto_binning = FALSE)

cross_plot(brands, input="age", target="brand")
## Plotting transformed variable 'age' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'

cross_plot(brands, input="elevel", target="brand")